6 research outputs found

    WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

    Full text link
    Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method -- called WECHSEL -- to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses subword-based tokenization and learns an embedding for each subword. The tokenizer of the source model (in English) is replaced with a tokenizer in the target language and token embeddings are initialized such that they are semantically similar to the English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer the English RoBERTa and GPT-2 models to four languages (French, German, Chinese and Swahili). We also study the benefits of our method on very low-resource languages. WECHSEL improves over proposed methods for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.Comment: NAACL 202

    Semantic HELM: A Human-Readable Memory for Reinforcement Learning

    Full text link
    Reinforcement learning agents deployed in the real world often have to cope with partially observable environments. Therefore, most agents employ memory mechanisms to approximate the state of the environment. Recently, there have been impressive success stories in mastering partially observable environments, mostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft. However, existing methods lack interpretability in the sense that it is not comprehensible for humans what the agent stores in its memory. In this regard, we propose a novel memory mechanism that represents past events in human language. Our method uses CLIP to associate visual inputs with language tokens. Then we feed these tokens to a pretrained language model that serves the agent as memory and provides it with a coherent and human-readable representation of the past. We train our memory mechanism on a set of partially observable environments and find that it excels on tasks that require a memory component, while mostly attaining performance on-par with strong baselines on tasks that do not. On a challenging continuous recognition task, where memorizing the past is crucial, our memory mechanism converges two orders of magnitude faster than prior methods. Since our memory mechanism is human-readable, we can peek at an agent's memory and check whether crucial pieces of information have been stored. This significantly enhances troubleshooting and paves the way toward more interpretable agents.Comment: To appear at NeurIPS 2023, 10 pages (+ references and appendix), Code: https://github.com/ml-jku/hel

    Learning to Modulate pre-trained Models in RL

    Full text link
    Reinforcement Learning (RL) has been successful in various domains like robotics, game playing, and simulation. While RL agents have shown impressive capabilities in their specific tasks, they insufficiently adapt to new tasks. In supervised learning, this adaptation problem is addressed by large-scale pre-training followed by fine-tuning to new down-stream tasks. Recently, pre-training on multiple tasks has been gaining traction in RL. However, fine-tuning a pre-trained model often suffers from catastrophic forgetting. That is, the performance on the pre-training tasks deteriorates when fine-tuning on new tasks. To investigate the catastrophic forgetting phenomenon, we first jointly pre-train a model on datasets from two benchmark suites, namely Meta-World and DMControl. Then, we evaluate and compare a variety of fine-tuning methods prevalent in natural language processing, both in terms of performance on new tasks, and how well performance on pre-training tasks is retained. Our study shows that with most fine-tuning approaches, the performance on pre-training tasks deteriorates significantly. Therefore, we propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation of learned skills by modulating the information flow of the frozen pre-trained model via a learnable modulation pool. Our method achieves state-of-the-art performance on the Continual-World benchmark, while retaining performance on the pre-training tasks. Finally, to aid future research in this area, we release a dataset encompassing 50 Meta-World and 16 DMControl tasks.Comment: 10 pages (+ references and appendix), Code: https://github.com/ml-jku/L2

    Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning

    Full text link
    In lifelong learning, an agent learns throughout its entire life without resets, in a constantly changing environment, as we humans do. Consequently, lifelong learning comes with a plethora of research problems such as continual domain shifts, which result in non-stationary rewards and environment dynamics. These non-stationarities are difficult to detect and cope with due to their continuous nature. Therefore, exploration strategies and learning methods are required that are capable of tracking the steady domain shifts, and adapting to them. We propose Reactive Exploration to track and react to continual domain shifts in lifelong reinforcement learning, and to update the policy correspondingly. To this end, we conduct experiments in order to investigate different exploration strategies. We empirically show that representatives of the policy-gradient family are better suited for lifelong learning, as they adapt more quickly to distribution shifts than Q-learning. Thereby, policy-gradient methods profit the most from Reactive Exploration and show good results in lifelong learning with continual domain shifts. Our code is available at: https://github.com/ml-jku/reactive-exploration.Comment: CoLLAs 202

    Improving Generalization of Deep Convolutional Neural Networks for Acoustic Scene Classification

    No full text
    In recent years deep learning has become one of the most popular machine learning techniques for a vast variety of complex problems. An example for such a task is to mirror the human auditory system to classify audio recordings according to the location they were recorded in. This work focuses mainly on the Acoustic Scene Classification task proposed by the IEEE DCASE Challenge. The dataset for Acoustic Scene Classification consists of recordings from distinct recording locations. The aim of the challenge is to classify an unseen test set of recordings. In the challenge of 2016 the training and test set did not differ significantly. In the challenge of 2017, however, the test set originated from a different distribution, implying a strong need for generalization. In the course of this work, the initial implementation consisting of a Deep Convolutional Neural Network for the DCASE 2016 challenge submission (done in Lasagne) was re-implemented in Keras. An extension of the ADAM optimizer (AMSGrad) was investigated for improvement in generalization. Other submissions to the DCASE 2017 challenge suggest that different types of spectrograms might be key for better generalization. Therefore experiments utilizing different kinds of spectrograms were conducted. Furthermore, different interpolation algorithms were used for data augmentation, with some of them yielding significant improvements in classification accuracy and generalization. For different spectrogram dimensions, slight adjustments in the network architecture also resulted in a performance gain. To better understand what different models "see" and what they focus on, their filters, and activations were visualized and compared for differences. Finally the adjustments which led to better generalization on the dataset of the DCASE 2016 challenge were tested on the dataset of the DCASE 2017 challenge, leading to an improvement over all submissions to the DCASE 2017 challenge from the Institute of Computational Perception

    History Compression via Language Models in Reinforcement Learning

    Full text link
    In a partially observable Markov decision process (POMDP), an agent typically uses a representation of the past to approximate the underlying MDP. We propose to utilize a frozen Pretrained Language Transformer (PLT) for history representation and compression to improve sample efficiency. To avoid training of the Transformer, we introduce FrozenHopfield, which automatically associates observations with pretrained token embeddings. To form these associations, a modern Hopfield network stores these token embeddings, which are retrieved by queries that are obtained by a random but fixed projection of observations. Our new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module. Since a representation of the past need not be learned, HELM is much more sample efficient than competitors. On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm.Comment: ICML 202
    corecore